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Abstract 

In almost every scientific field, an experiment involves collecting data 
and then analysing it. The analysis stage will often consist in trying to 
extract some physical parameter and estimating its uncertainty; this is 
known as Parameter Determination. An example would be the determi- 
nation of the mass of the top quark, from data collected from high energy 
proton-proton collisions. A different aim is to choose between two possi- 
ble hypotheses. For example, are data on the recession speed s of distant 
galaxies proportional to their distance d, or do they fit better to a model 
where the expansion of the Universe is accelerating? 

There are two fundamental approaches to such statistical analyses - 
Bayesian and Frequentist. This article discusses the way they differ in 
their approach to probability, and then goes on to consider how this affects 
the way they deal with Parameter Determination and Hypothesis Testing. 
The examples are taken from every-day life and from Particle Physics. 

1 INTRODUCTION 

There are two fundamental approaches to statistical analysis, Bayesianism and 
Frequentism. The Bayesian approach dates back to Reverend Thomas Bayes, 
whose paper was publishes posthumously in 1763. The Polish statistician Jerzey 
Neyman played a crucial role in the development of frequentist statistics. In the 
past there have been vigorous discussions about the relative merits of these two 
methods (see fig. [2]). 

In this article, the fundamental differences between these two approaches will 
be explained, and illustrated with examples from Physics and from every-day 
life. We consider them in situations where we are trying to measure a parameter 
(e.g. the mass of the top quark), or are testing hypotheses (e.g. do we have 
evidence for the existence of the Higgs boson?) 



^To appear in Contemporary Physics 



Figure 1: The Reverend Bayes (left) , whose paper on his theorem was pubUshed 
posthumously in 1763; and Jerzy Neyman, a Polish statistician who played a 
crucial role in the development of the frequentist approach. 



1.1 Why the fuss? 

Given that there are these fundamentally different ways of analysing data, how 
is it possible that many scientists spend a lifetime measuring all sorts of physical 
parameters, without being aware of the sharp differences of philosophy between 
the Bayesian and Frequentist approaches? The answer is that in the simplest 
of problems the two methods (and others too, like or maximum likelihood) 
can give the identical answer, that the probability that a parameter /x lies in the 
range m to /i„ is, say, 68%. By the 'simplest of problems', we mean that the 
measured value m is Gaussian distributed about the true value n with known 
variance a^, and that fi can in principle have any value from minus infinity to 
plus infinity. 

However, in many practical problems in Particle Physics, these conditions 
are not satisfied. The parameter may be restricted in range (masses cannot 
be negative), and the data distribution may not bo Gaussian (counting exper- 
iments often follow the Poisson distribution). So there is ample opportunity 
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Figure 2: This incident from the Wold Cup soccer final in 2006 was used to il- 
lustrate the 'discussions' that took place between sub-groups within a particular 
experiment about the relative merits of Bayesian and Frequentist analyses[T]. 



for the results of Bayesian and frequentist analyses to differ. The two types of 
statisticians have often had strong criticisms of each other's approach. 

1.2 Probability 

The differences between the Bayesians and Frequentists start with their in- 
terpretation of 'probability'. Underpinning both of these is the mathematical 
approach, which is largely due to Russian mathematicians such as Kolmagorov. 
It is based on axioms (e.g. probability is a number in the range to 1; the 
sum of the probabilities for something to occur and for it not to occur is 1; 
etc.). This is very important for manipulating probabilities, but provides little 
physical intuition about the concept. 

For frequentists, the probability p of 'something' is defined in terms of a large 
number N of essentially identical, independent trials: if the specified 'something' 
happens in s of these trials, p is defined as the limit of the ratio s/N, as N tends 
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to infinity. Thus the probability of the sum of the numbers on two rolled dice 
adding up to 10 can be determined in this way to be 1/12. 

Bayesians attack this definition, as it requires a large number of 'essentially 
identical' trials. They claim that to determine whether the trials arc indeed 'es- 
sentially indentical' requires the concept of probability, and hence the definition 
is circular. 

Given that a repeated series of trials is required, frequentists are unable 
to assign probabilities to single events. Thus, with regard to whether it was 
raining in Manchester yesterday, there is no way of creating a large number 
of 'yesterdays' in order to determine the probability. Frequentists would say 
that, even though they might not know, in actual fact it either was raining or 
it wasn't, and so this is not a matter for assigning a probability. And the same 
remains true even if we replace 'Manchester' by 'the Sahara Desert'. 

Another example would be the unwillingness of a frequentist to assign a 
probability to the statement that 'the first astronaut to set foot on Mars will 
return to Earth alive.' This does not mean it is an uninteresting question, 
especially if you have been chosen to be on the first manned-mission to Mars, 
but then, don't ask a frequentist to assess the probability. 

A different type of example involves physical constants. Frequentists will 
also not assign probabilities to statements involving the numerical values of 
physical parameters e.g Does dark matter constitute more than 25% of the 
the critical density for our Universe? This again is a situation which cannot be 
checked by replicated tests. And again, it is either true or false, and not suitable 
for frequentist probabilities. A similar argument applies to statements about 
theories: a Frequentist will not allow probability assignments as to whether the 
Higgs boson exists. 

Bayesians have a very different approach. For them, probability is a personal 
assessment of how likely they think something is to be true. It depends on their 
own judgement and/or previous knowledge about the situation, and can hence 
vary from person to person. Thus if I toss a coin, and ask you what is the 
probability of the result being heads, you are likely to say 50%. But maybe 
I cheated and looked at the coin, and saw that it was tails, so for me the 
probability of heads is 0%. Or maybe I just gave it a quick glance, and think 
(but am not certain) that it was tails, so I assign a probability of 20% to heads. 

Because Bayesians have this personal view of probability, they would be 
prepared to give numerical estimates for 'one-off' situations (e.g. who gets this 
year's Nobel Prize?), for parameter values (e.g. fraction of dark matter), or 
concerning theories (e.g. existence of Higgs boson). Again, these numerical 
assessments could vary from person to person. 
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PERSONAL PROBABILITIES 

This is a story I originally heard from Nobel Prize winner Frank 
Wilczek in a slightly different context, but it illustrates the way that 
for Bayesians the assessment of probability can differ from person to 
person. 

A shy postdoc is attending a workshop on the topic of 'Extra Dimen- 
sions'. Each evening, after an intensive day's work, he goes to the local 
bar, sits next to an empty chair and orders two glasses of wine, one for 
himself and the other for the empty chair. By the third evening, the 
barman's curiosity cannot be controlled and he asks the postdoc why 
he always orders the extra glass of wine. 'I work on the theory of ex- 
tra dimensions', explains the postdoc, 'and it is possible that there are 
beautiful girls out there in 12 dimensions, and maybe by quantum me- 
chanical tunneling they might appear in our 3-dimensional world, and 
perhaps one of them might materialise on this empty chair, and I would 
be the first person talking to her, and then she might go out with me'. 
'Yes', says the barman, 'but there are three very attractive real girls 
sitting over there on the other side of this bar. Why don't you go and 
ask them if they would go out with you?' 'There's no point', replies the 
postdoc, 'that would be very unlikely.' 



It sounds as if this is very personal and not conducive to numerical estimates. 
But Bayesians' assessment of probability should be consistent with the 'fair bet' 
concept. If a Bayesian believes that a certain statement has a 10% probability of 
being true, they should be prepared to offer odds of 9 to 1 (or 1 to 9) to someone 
who is prepared to bet with them on this being true (or false, respectively) . With 
a poor assessment of the probability, they would be in danger of losing money. 



2 LIKELIHOODS, BAYES THEOREM AND PRI- 
ORS 

We now have a relevant digression into considering likelihood functions, and 
then introduce Bayes Theorem and Priors, essential ingredients of the Bayesian 
approach. 



2.1 Likelihoods 

The likelihood approach is a very powerful one for parameter determination, 
and is also very much involved in Bayesian and Frequentist methods for this. 
Likelihood ratios are also used for checking which of two theories provides a 
better description of the data. 

The likelihood function is best illustrated by a simple example. Imagine we 
are performing a counting experiment for some fairly rare process. For example, 
we may be interested in the flux of cosmic ray showers with energies above 
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10^° electron volts. We have a large detector of known area, and find tiq high 
energy showers (e.g. 2) when running the detector for one year. We want to 
make a statement about the value of the actual flux /i and its uncertainty. 

Assuming these cosmic rays are falling on earth at a constant rate, and 
are independent of each other, if the true rate is fi, the conditional probability 
P{n\ij) of obtaining n counts is given by the Poisson distribution as 

P(n|^) = e-^Ai"/n! (1) 

Then the likelihood is defined by replacing n in the above formula by the ob- 
served value riQ. i.e. 

i(M|no) = e-V"V"o! (2) 

This likelihood is regarded as a function of for the fixed data value uq. (For 
example, if we observe 2 events, the likelihood is /x^ /2.) It is the probability 
of observing the data, for different choices of /i. Then the likelihood estimate 
of a parameter /x is that which maximises the likelihood i.e. It is the value of ^ 
which maximises the probability of observing the actual data tiq. (In our case, 
not surprisingly the likelihood estimate of /i is simply riQ.) Values of /i for which 
the likelihood is small are regarded as excluded, and the uncertainty on /i is 
related to the width of the likelihood distribution. 



A POISSON PUZZLE? 

According to the Poisson distribution, if the expected number of ob- 
servations in a specified time is /i, the probabilities P(l|/i) and P{2\^) 
are 

For small /i, these are approximately ^, and /x^/2 respectively. Given the 
fact that the probability for observing one rare event in the time interval 
is /i, why is the probability for observing two independent events equal 
to /i^/2, rather than simply /x^, as perhaps expected from eqn. ml 



It is really important not to confuse the Poisson probability P(n|/i) with the 
likelihood function L(/x|no): even though eqns. ([l]) and ^ bear a remarkable 
similaritjj^ . The distinction should be easy in this case: P(n|/z) is a function of 
the discrete variable n at fixed /i, while L(/i|no) is a function of the continuous 
variable ^ at fixed tiq (see fig. [3|. Furthermore, P(n|/i) are real probabilities, 
while the likelihood L(^|no) cannot be interpreted as a probablity density (it 
does not transform as expected for a probability density if the parameter is 
chosen, for example, as rather than ^). 

^The '!' symbol in eqns and ([2| not only expresses surprise ('Wow! These equations 
look very similar), but it also denotes the factorial. 
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Figure 3: Illustration of the difference between the probability density distri- 
bution for the integer variable n and the likelihood function for the continuous 
parameter /it, for the Poisson distribution (see eqns. ([l]) and ([2|). They involve 
the same function of n and /i, but it is evaluated at fixed /i for the pdf , but at 
fixed n for the likelihood. 



2.2 Bayes theorem 

If we consider two 'events' A and B (in the statistical sense), we can write the 
probability P{A and B) of them both happening as 

P{A and B) = P{A\B) P{B) (3) 

where P{A\B) is the conditional probability of A happening, given the fact that 
B has occurred. An example could be where we select a random day from last 
year, and A is whether it was snowy in Oslo, and B that it was a December day. 
Then Bayes Theorem says that the probability of choosing a snowy December 
day is equal to the probability of it being snowy in December, multiplied by 
31/365 (the probability of a random day being in December). If the probability 
of A occuring does not depend on whether B has done so, eqn. ^ reduces to 
the better-known result that 

P{A and B) = P{A) P{B), for independent A and B (4) 



CONDITIONAL PROBABILITY 

Conditional Probability P{A\B) is the probability of A, given the fact 
that B has happened. For example, the probability of obtaining a 4 
on the throw of a dice is 1/6; but if we accept only even results, the 
conditional probability for a 4, given that the number is even, is 1/3. 
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Because P{AandB) is symmetric in A and _B, 

P{A and B) = P{A\B) P{B) = P{B\A) P{A) (5) 

Then Bayes Theorem is derived from the second equahty above: 

P{A\B) = P{B\A) P{B)/P{A) (6) 

i.e. It relates P{A\B) to P{B\A). (See section [2!4| for examples where these are 
obviously not equal.) 

It should be stressed that Bayes Theorem itself is not controversial, and 
frequentists are willing to make use of it, provided the various probabilities are 
genuine frequentist ones. The controversy begins when Bayesians replace A by 
a theoretical parameter (and B is the observed data). The theorem then states 
that 

P(j)aram\data) oc P{data\param) x Pijaaram) (7) 

where P{data\param) is just the likelihood function; P{parani) is the Bayesian 
prior density, and expresses what was known about the parameter before our 
measurement; and P(param\data) is the Bayesian posterior probability density 
for the parameter, and contains the information about the parameter obtained 
by combining the prior information with that from our measurement. 



BAYESIAN POSTERIORS 

Jim Berger says that he and his wife have professions that are similar, 
but with a small difference. He is a Bayesian Statistician and she is a 
fitness trainer. The similarity is that they both aim to optimise poste- 
riors, but while he wants to maximise them, she aims to minimise her 
clients' posteriors. 



The frequentist objection to this is that the prior and the posterior both refer 
to parameter values; while this is allowed for Bayesians, it is strictly forbidden 
in the frequentist approach. In addition to this, complications are caused by 
the need to choose a probability density for the prior. 

2.3 Bayesian priors 

In order to obtain the Bayesian posterior probability distribution from the like- 
lihood function, the latter needs to be multiplied by the Bayesian prior. There 
are several flavours of Bayesians, who have different motivations for their choice 
of prior. In this article, we will concentrate on evidence-based priors. So if 
in our Poisson example of Section |2.1[ there was a previous measurement of 
jjL which gave the result 5 ± 1, the prior might be chosen as a Gaussian in /i, 
centred on 5 with standard deviation 1 (and probably truncated at zero). Then 
the posterior, assuming 2 observed counts, would be 

P(/i|no = 2) = {^? x {e-^^-'^'^' / ^2^), (8) 
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where the first factor on the right is the likelihood L{fj,\no = 2), and the second 
is the prior 7r(/z). 

This is all very well when previous data on /x exists. But what if our mea- 
surement is ground-breaking, and essentially nothing is previously known about 
/z? How do we now choose the prior 7r(/u)? The 'obvious' answer is to choose a 
prior that is independent of /z (but zero for unphysical negative /i), so as not to 
favour any particular value of /i. However, do we really believe, as implied by 
the constant prior, that /U is as likely to be in the range 10^"° to 10^°° + 0.5, as 
in 0.1 to 0.6? 

Another problem is that if wc are aiming to use a flat prior to express our 
ignorance about a parameter, it is not clear why we should choose one functional 
form for the parameter rather than another. For example, if we are aiming to 
provide a very precise measurement of the mass m of the tan neutrino, should 
we parametrise our ignorance of its mass by a flat prior in m, m^, ln(m), etc? 
Basically priors may be not bad for parametrising prior knowledge, but are not 
so good for prior ignorance. 

2.4 PiA\B) ^ PiB\A) 

Bayes Theorem relates the conditional probabilities P{A\B) and P{B\A). Peo- 
ple often confuse these two probabilities, and may erroneously think they are 
the same. Thus journalists or even Laboratory Spokespersons may incorrectly 
say that there is a 99.9% probability that some particle exists, rather than the 
correct statement that under the null hypothesis that it does not, the data are 
very unlikely. 

A convincing example of their difference is provided by a database containing 
just 2 pieces of information about everyone on Earth: their sex and whether or 
not they are pregnant. We extract a random person from the database, who 
turns out to be female. Given that the person is female, the chance of being 
pregnant is about 3%. We then extract another random person, who turns out 
to be pregnant. Given the fact that the person is pregnant, the probability that 
they are female is 100%. i.e. 

P{pregnant\f emale) <C P{female\pregnant) (9) 

Similarly, if you select a card randomly from a deck of 52, the probability of 
it being an ace, if it happens to be a spade, is 1/13; however, the probability of 
a spade, given that it is an ace, is 1/4. 

2.5 A Bayesian example 

Imagine that you, a Bayesian, are betting on the results of coin flips. Each time 
you bet 'Heads', and for the first 5 flips it comes down 'Tails'. Given that the 
probability of being wrong 5 times is 3%, should you suspect it is not a fair 
coin? 

We regard this as a parameter-estimation problem, and want to see whether 
the probability ph of 'Heads' is consistent with 0.5. The data (no heads in 5 
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spins) enables us to calculate the likelihood function, but in order to extract 
the posterior probability as a function of ph, we must multiply the likelihood 
by a prior -k{ph)- Now if the person betting against us is a complete stranger, 
we might assign a constant value for in the range to 1; then the posterior 
is such that ph = 0.5 looks unlikely. On the other hand, if it is our local 
village priest, we are so convinced that he is honest, we use a delta function at 
Ph = 0.5, and then even if the coin continues to fall down 'Tails', we will still 
believe that it is fair. Thus our conclusion depends very much on which prior 
we choose. 

Given the freedom to select one's prior, it seems as if Bayesian intervals for 
a parameter can be very dependent on this choice. But in some cases, the 'data 
overwhelms the prior', and the result becomes very insensitive to the choice 
of prior. For example, the mass of the intermediate vector boson {Z^) was 
measured at the LEP (Large Electron Positron) Collider at CERN. The result 
was that the likelihood function was essentially a Gaussian at 91,188 MeV/c?, 
with a width of 2 MeV/c? . A Bayesian now has to multiply this by the prior 
probability density for the mass. However, any reasonable choice of prior 
will vary very little over the range of a few parts in 10^, and so in this case the 
posterior is essentially independent of the prior. 



3 PARAMETER DETERMINATION: BAYESIAN 
APPROACH 

Wc illustrate the Baysian approach using a simple example of the determination 
of the lifetime of some radioactive material. The probability density p for a decay 
at time t is given by 

p{t\T) = {l/T)e-''^ (10) 

where r is the lifetime we want to estimate. We can estimate r from a set of 
decays at observed times U. To simplify the problem we assume we have only 
one decay at time ti (which will not give us a very accurate estimate of r). 
The likelihood is 

L(r) = (l/r)e-*^/^ (11) 

and we have to multiply this by our choice of prior for r, to obtain the posterior 

p(t|<i) = L{t)'k(t). As usual, there is a choice for 'it{t) of an evidence-based 
prior derived from a previous measurement (in which case our posterior and the 
resulting range for r will be based not only on our measurement, but also on the 
previous one), ignorance, theoretical motivation, etc. Because in many cases the 
choice of prior is not unique, Bayesian analyses are supposed to present results 
for several plausible priors, so as to investigate the sensitivity of the result to 
the choice of prior. 

Once the posterior is available, several options are available for determining 
a range of preferred r values at some chosen probability level 7, i.e. 

[ "p{T\h)dT = J. (12) 
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Possibilities include: 



• A central range from r; to could be obtained by having probabilities of 
(1 — 7)/2 below the range, and (1 — 7)/2 above it. 

• The upper limit tul is obtained by setting the limits of integration in eqn. 
(12) from zero to tul. 

• In a similar manner, a lower limit t^l is obtained, using integration limits 
tll and infinity. 

• The shortest posterior range in r containing probabability 7 is also popu- 
lar, but does not correspond to the shortest range in the decay rate 1/t, 
or for other reparametrisations of the variable of interest. 



4 PARAMETER DETERMINATION: FREQUEN- 
TIST APPROACH 

We now consider the frequentist approach for the same problem as in the pre- 
vious section. 

The Neyman construction is used to show on a plot of the parameter r 
versus the data t the likely values of t for each r (see fig. |4f^a)). This is achieved 



by using p{t\T) of eqn. (10 1 for a given r to select a region of t such that the 
integral of p{t\T) over this range of t is, say, 68*)^ an example is denoted by the 
horizontal line in the figure. By repeating this procedure for all r, we obtain 
the 'confidence band'. In our example, the edges of the band are defined by 
the straight lines t = 0.17r and t — 1.8r. Finally we use the actual observed 
value ti to read off the range of r values (r; to r„, which are ii/1.8 and ti/0.17 
repectively) for which ti is a likely observation. For larger values of r, ti is too 
small to be likely, and similarly for smaller r, ti is too large. 

In a more plausible scenario where the data consisted of a set of observed 
decay times ti, the data statistic could be the mean of the ti. Then the con- 
fidence band would be narrower than in figure, and the range of acceptable r 
values would be shorter. 

An important feature of this construction is that it does not require a prior 
distribution for t, thus avoiding the possible ambiguity that that would have 
introduced. Another point to note is that the frequentist approach does not 
claim that the range r/ to is probable. Nor does it make any statement about 
different values within this range; it is merely that this is the range of t values 
for which the observed data is likely (at the chosen confidence level). 

Fig. |4|^b) shows a more interesting example. An experiment aims to measure 
the temperature T of the fusion reactor at the centre of the sun, by using a 



■^This definition does not provide a unique range. The one we show has a probability of 16% 
on either side of the shaded region, which is then known as a central interval. An alternative 
would be to have the whole of the 32% on the left hand side of the confidence interval; this 
would be useful for producing upper limits on r. 
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Figure 4: The Neyman construction, (a) For the exponential parameter r, 
the central confidence band between the lines t ~ 0.17t and t = 1.8r gives 
the likely values (at the 68% level) of t for each t. Then a vertical line at 
the observed ti intersects the edges of the confidence band at t; and r^,, and 
these define the frequentist range for t. (b) Here the theory parameter is the 
temperature T at the centre of the sun, and the data (j) is the measured flux 
of solar neutrinos, both in arbitrary units. A measurement of the fiux (p^hg 
determines a range of temperatures (T; to T„) at the sun's centre. 



month's running of a solar neutrino detector to estimate the neutrino flux </) 
from the sun. Assuming we know all about fusion processes and convection in 
the sun, the properties of neutrinos, the performance of our detectors, etc, we 
can construct at each T a region in such that there is a 68% probability the 
experimental result would lie in it. Then we use the actual measured flux (pobs 
to determine the accepted range for T. 

4.1 Coverage 

For repetitions of an experiment using a particular statistical analysis to de- 
termine a range for the parameter of interest, where the data sets differ from 
each other just by statistical fluctuations, the coverage is the fraction of the 
parameter's intervals that contain the true value of the parameter. This can be 
determined from Monte Carlo simulation, or in some simple cases analytically. 
Coverage is a property of the statistical technique that is used to construct the 
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intervals, and does not apply to a single measurement. 

Techniques for which the coverage is equal to the nominal value (e.g. 68%) 
for all values of the parameter are said to have exact coverage. If the coverage 
drops below the nominal value, the method is said to under-cover. Frequentist 
regard this as bad: if the actual coverage for determining the parameter is only 
30% rather than the nominal 68%, just quoting the range for the parameter 
as determined by that method is likely to mislead a reader into believing that 
your result is more accurate than it really is. Over-coverage does not have this 
problem, but it suggests that maybe the confidence intervals produced by that 
method are more conservative (i.e. wider) than they need be. 

A particularly important property of the Neyman construction is that the 
confidence intervals for the parameter have the property of not undercovering. 
This property is not guaranteed for other techniques (e.g. Bayesian, x^, maxi- 
mum likelihood, method of moments). 



r 
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Figure 5: Coverage C for Poisson parameter intervals, as determined by the 
A(ln(i)) = 0.5 rule. Repeated trials (all using the same Poisson parameter /i) 
yield different values of the observation n, each resulting in a range /i/ to /i„ for 
/x; then C is the fraction of trials that give ranges which include the value of 
/i chosen for the trials. The coverage C varies with /x, and has discontinuities 
because the data n can take on only discrete integer values. For large /i, C 
seems to approach the expected 0.68. 



Fig. [5] shows the coverage C for the following situation. An experiment 
is performed to determine the rate /i of some Poisson counting experiment. 
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and n counts are observed. The statistical procedure chosen for determining a 
68% range for fj, is the hkehhood method with the A{lnL) = 0.5 rule to define 
the ends of the range. In envisaged repetitions of the experiment, n will vary 
according to a Poisson distribution with mean /iq. Then C{^q) is the fraction 
of the resulting ranges for ji which include ^o- The likelihood method does 
not have the frequentist guarantee of coverage, and indeed large under- and 
over-coverage occur, especially at low 

5 PARAMETER DETERMINATION: COMMON 
ISSUES 

Here we discuss some issues that are common to both Frequentist and Bayesian 
approaches. 

5.1 Parameters with limited range 

Very often a physical parameter has meaning only over a limited range. For 
example, the square of the mass of the neutrino {ml,) produced in nuclear beta 
decay cannot be negative, the branching ratio for some particular decay mode 
of an elementary particle must be between zero and one, etc. Bayesians can 
incorporate this information by setting the prior for the parameter to zero in 
the non-physical region. This ensures that the best estimate of a parameter or 
an upper limit for it are guaranteed to be physical. In contrast, a frequentist 
upper limit could well turn out to be unphysical, or the range for could be 
empty (i.e. there was no physical value of for which the data was likely); in 
general Particle Physicists are unhappy with such a situation. 

For many years, experiments estimating had 'likelihood functions' that 
maximised at negative values. Upper limits for to^ were then usually derived 
by Bayesian methods. 

5.2 Interpretation of > yU > 

Both Bayesian and Frequentist methods of parameter determination end up 
with a statement of the form fj.u > fJ. > fj-i at some probability level, but their 
interpretations are very different. 

For frequentists, the parameter /i is unknown, but it does have a true value 
and, as discussed earlier, it is not suitable for probability statements. So the 
probability refers to the range /i; to If the experiment were to be repeated 
many times, a series of ranges for /i would be obtained, and the probability 
refers to what fraction of these ranges contain the true value; this is just the 
coverage mentioned in Section |4.1[ Thus frequentists regard the ends of the 
range as random variables. 

For Bayesians, /z„ and /i; have been detemined by the experimental analysis, 
and are considered fixed; Bayesians do not want to be involved in deciding what 
would have happened in hypothetical repetitions of the experiment. But they 
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Table 1: Interpretations of > fi > m a.t 68% confidence level' 





Baycs 


Frcqucntist 


What is fixed? 


Mil, 


fJ- 


What is variable? 




fJ-u, Mi 


What does 68% proba- 


Single measurement: percentage 


Set of measurements: percentage 


bility apply to? 


of /i's posterior in range 


of ranges jii -> /i„ that contain jj, 



are prepared to treat the unknown physical constant as if it were a random 
variable, and for them the probability refers to the fraction of the Bayesian 
posterior probability density for /i is within the quoted range. 

5.3 Dealing with systematics 

Very often, in trying to estimate a parameter, some other quantity involved in 
the analysis is not known exactly, and this can affect the deduced range for 
the parameter of interest. For example, in the original Reines and Cowan ex- 
periment [3] to discover the electron neutrino, a detector sensitive to neutrinos 
interacting in it was built close to a powerful nuclear reactor. However, there 
were also background processes which mimic the interactions of the reactor neu- 
trinos. Then the observed number of counts n is likely to be Poisson distributed 
with mean b + s: 

P(n) e-(''+") (fe + s)"/n! (13) 

where b is the expected background, and s is the signal rate. If b is precisely 
known, s is the only unknown parameter, and can be determined essentially as 
described earlier. But if there is some uncertainty in the expected value of 6, 
this results in a systematic uncertainty in the answer. Statisticians refer to b as 
a nuisance parameter. 

Bayesians tend to treat all parameters (i.e. those of physical interest and 
nuisance parameters) in a similar manner. Thus, assuming that the background 
b has been estimated in a subsidiary counting experiment as ttiq while the result 
of the main measurement of s -I- & was ng, they would start by writing the 
likelihood for s and b as 

L{s, 5|no, "lo) = (e-('^+^) (s + fe)"V^o!) x (e"'' 6™V™o!) (14) 

Next this is multiplied by the chosen prior 7r(s, b) for s and 5, to give the posterior 
probability p(s, b) for the parameter of interest s and the nuisance parameter 
for the background b. Then this is integrated (or 'marginalised') over b to give 
the probability density just for the parameter of interest: 

p(s)= [p{s,b)db (15) 
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Finally the required parameter range is extracted from p(s) e.g. a central 68% 
range. 

In contrast, frequentists start from the probability density p(n, m|s, 6) for 
observing any n and m as 

p{n, m\s, b) = (e-('*+'') (s + 6)"/n!)) x (e"'' &™/m!) (16) 

The fully frequentist method consists in performing a Neyman construction to 
produce a confidence belt for likely data (n, m) as a function of the parameters 
{s,b). In analogy with the simpler problems discussed earlier, the actual data 
(no, mo) is then used to read off the region in parameter space (s,6) for which 
the data is likely. If a range just for s is desired, it could be taken as the extrema 
of the (s, b) region, although this will give rise to overcoverage. 

There are also various approximate methods, which are simpler than the 
full Neyman construction and which tend to produce less overcoverage (but for 
which the frequentist guarantee of coverage no longer applies). An example is 
the profile likelihood approach, in which the probability p(n, m|s, b) is replaced 
by Pprofin,m\s,bbestis)), where bbest{s) is the value of b which maximises the 
probability for that value of s; because bbest{s) is a function of s, the profiled 
probability depends on the single parameter s, which simplifies the problem. 



PROFILE LIKELIHOOD 

In many situations, the probability of observing a particular set of data 
d depends not only on a parameter of physical interest (e.g. the 
mass of the Higgs boson), but also on some other so-called nuisance 
parameters v (e.g. a scale factor for correcting jet energies as measured 
in the detector). Then the likelihood L{(f>, i/\d) is a function of both sets 
of parameters (j) and i^. In order to draw conclusions about 0, it is offer 
helpful to consider the profile likelihood Lprof{4>, '^max{4')\d), where for 
each value of </>, the nuisance parameters are chosen to maximise the 
full likelihood L{(j),i>\d), i.e. I'max varies with (jj. However, Lprof now 
is a function just of (j) but not of the nuisance parameters z^, thereby 
simplifying the problem of making inferences about the parameter of 
interest at the cost of losing some of the properties of the likelihood 
function. 

Rather than maximising L with respect to Bayesian methods tend 
to marginalise, i.e. integrate the likelihood with respect to v, usu- 
ally after using priors for to convert L into a posterior probability 
distribution for (p. 

For the case where L is a multi-dimensional Gaussian distribution such 
as 

L (X exp{-{a(t)^ -t- 2b(j)v +cv'^)}, 

marginalisation over v or profiling with respect to it will give the same 
functional form for the modified likelihoods. 



Ref. contains a longer discussion of systematics, while Demortier[5] deals 



16 



with ways of incorporating systematics in both parameter determination and 
hypothesis testing. 



6 HYPOTHESIS TESTING 

Possibly more interesting than Parameter Determination is Hypothesis Testing. 
Here the issue is to decide which of two (or more) competing theories provides 
a better fit to some data. For example, was data collected at the Large Hadron 
Collider at CERN in the first half of 2012 more consistent with what is known as 
the Standard Model (S.M.) of Particle Physics without anything new, or with 
the production of the Higgs boson in addition to the known S.M. processes? 
(See fig. [6]) 




Figure 6: The observed distribution in the CMS experiment for the effective 
mass m^^ of pairs of 7s produced in high energy proton-proton collisions at 
CERN's LHC. If the Higgs boson exists and decays to a pair of 7s, it could result 
in a peak centred on the mass of the Higgs. Otherwise, the expected distribution 
is expected to be smooth. In the main plot, the events are weighted according to 
their quality; the inset shows unweighted events. The apparent peak around 125 
GeV is part of the evidence for the existence of a new particle, whose properties 
seem consistent with those expected for a Higgs boson. 



In Particle Physics, for reasons to be explained below, it is much more com- 
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mon to use a frequentist method to decide. In other fields, Bayesian approaches 
tend to be favoured. We discuss Bayesian methods briefly in section [7| 

6.1 Frequentist approach 

The first task is to choose some data statistic t which will help distinguish 
between the hypotheses. In the simple case of a counting experiment, where the 
data consists just of the number of accumulated counts no for a given amount 
of running time, it could simply be uq. Then in most cases new physics would 
manifest itself in a larger number of counts when the expected rate is s + 6, 
than if there were just background; here s and b are the expected signal and 
background rates respectively. 

In more complicated cases, the data could consist of one or more histograms 
or multi-dimensional distributions. Then usually t is chosen as a likelihood ratio 
for the data, assuming the two hypotheses: 

t^Li{Hi\d)/Lo{Ho\d) (17) 

where Li is the likelihood for Hi (the hypothesis of signal + background), 
given the data, while Lq is for the background only hypothesis Hq, given the 
same data. When the hypotheses are completely specified without any free 
parameters, they are known as 'simple hypotheses' and the above formulation 
is satisfactory. Then the Neyman-Pearson lemma[5] says that if we choose Hq 
based on the likelihood ratio being below some suitably defined cut-off, this will 
guarantee that we will achieve the lowest rate for "Errors of the Second Kind" 
(i.e. incorrectly selecting Hq when Hi is true), for a given rate for "Errors of the 
First Kind" (i.e. rejecting Hq when it is true ). 

If, however, one or more of the hypotheses involves free parameters ('com- 
posite hypotheses'), the Neyman-Pearson lemma does not apply. Nevertheless 
a form of the likelihood ratio, such as the ratio of profile likelihoods, is often 
used as a method that may well be nearly optimal. 

6.2 p- values 

For the null hypothesis Hq, the expected distribution of our test statistic t is 
/o(t). Then for a given observed value tobs, the p- value is the fractional area 
in the tail of /o(i) for t greater than or equal to tots- For definiteness we 
consider the single-sided upper tail (assuming that the alternative hypothesis 
yields larger values of tots), but lower or 2-sided tails could be appropriate in 
other cases. 

A small p-value means that the data are not very consistent with the hy- 
pothesis. Apart from the possibility that the cause of the discrepancy is new 
physics, it could be due to an unlikely statistical fluctuation, an incorrect imple- 
mentation of the hypothesis being tested, an inaccurate allowance for detector 
effects, etc. 

As more and more data are acquired, it is possible that a small (and perhaps 
not physically significant) deviation from the tested null hypothesis could result 
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in the po becoming small as the data become sensitive to the small deviation. 
For example, a set of particle decays may be expected to follow an exponential 
distribution, but there might be a small background characterised by decays at 
very short times, and which is not allowed for in the analysis. A small amount 
of data might be insensitive to this background, whereas a large amount of data 
might give a very small p- value for a test of exponential decay, even though the 
background is fairly insignificant. With enough data, we may be able to include 
physically motivated corrections to our naive Hq. The possibility of a statisti- 
cally significant but physically unimportant deviation has been mentioned by 
Cox[7]. 

It is extremely important to realise that a p-value is the probability of ob- 
serving data like that observed or more extreme, assuming the hypothesis is 
correct. It is not the probability of the hypothesis being true, given the data. 
These are not the same - see section [531 

Many of the negative comments about p-values are based on the ease of 
misinterpreting them. Thus it is possible to find statements that of all experi- 
ments quoting p-values below 5%, and which thus reject Hq, many more than 
5% are wrong (i.e Hq is actually true). In fact, the expected fraction of these 
experiments for which Hq is true depends on other factors, and could take on 
any value between zero and unity, without invalidating the p-value calculation. 

6.3 p-values for two hypotheses 

With two hypotheses Hq and Hi, we can define a p-value for each of them. 
We adopt the convention that Hi results in larger values for the statistic t 
than does Hq. Then po is defined as the upper tail of /o(i), the probability 
density function (pdf) for observing a measured value t when Hq is true. It 
is conventional to define pi by the area in the lower tail of fi{t) (i.e towards 
the Hq distribution) - see Fig. ^h), which shows the probability densities for 
obtaining a value t of a data statistic, for hypotheses Hq and Hi. For a specific 
value tabs , the p- values po and pi correspond to the tail areas above tabs for the 
Hq pdf, and below tabs for Hi, respectively]^ Then tcrit is the critical value of t 
such that its po value is equal to a preset level a for rejecting the null hypothesis. 

The pi-value when t = tcrit is denoted by /3, and the power of the test 
is 1 — (3. The power is the probability that we successfully reject the null 
hypothesis, assuming that the alternative is true. We expect the power to 
increase as the signal strength in Hi becomes stronger, and the pc?/s for Hq and 
Hi become more separated. 

Depending on the separation of the two pdfs and on the value of the data 
statistic t, several situations are now possible (see Table [2]): 

• pi is small, but po acceptable. Then we accept Hq and reject Hi. i.e. we 
exclude the alternative hypothesis. 

*If t is a discrete variable, such as a number of events, then 'above' is replaced by 'greater 
than or equal to', and correspondingly for 'below'. 
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HO 1 
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Figure 7: Expected distributions for a statistic t for Hq — background only 
(solid curves) and for Hi = background plus signal (dashed curves). In (a), 
the signal strength is very weak, and it is impossible to choose between Hq 
and Hi. As shown in (b), which is for moderate signal strength, pq is the 
probability according to Hq of t being equal to or larger than the observed to. 
To claim a discovery, pq should be smaller than some pre-set level a, usually 
taken to correspond to 5a; t^rit is the minimum value of t for this to be so. 
Similarly pi is the probability according to Hi for t < to. The exclusion region 
corresponds to to in the 5% lower tail of Hi. In (b) there is an intermediate "No 
decision" region. In (c) the signal strength is so large that there is no ambiguity 
in choosing between the hypotheses. In order to protect against a downward 
fluctuation in a situation like (a) resulting in an exclusion of Hi when the curves 
are essentially identical, CL^ is defined as Pi/{1 — Po) (see Section 6.6 1. 



• Po is very small, and pi acceptable. Then we accept Hi and reject Ho- 
This corresponds to claiming a discovery. 

• Both Po and pi are acceptable. The data are compatible with both hy- 
potheses, and we are unable to choose between them. 

• Both Po and pi are small. The choice of decision is not obvious, but 
basically both hypotheses should be rejected. 
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5ct discovery, 95% EXCLUSION 

Searches for new phenomena in Particle Physics typicaUy choose the 
'Standard Model' as the null hypothesis Hq, and a specific form of 
New Physics as Hi. The exclusion level for Hi is usually set at 5%, 
whereas that for rejecting Hq (and perhaps claiming the discovery of 
New Physics) is usually 'Scr', i.e po < 3 x 10~^. 

Some (not very convincing) reasons for the stringent criterion for reject- 
ing Hq include: 

• The past history of false discovery claims at 3(t and 4cr levels. 

• The possibility that systematic effects have been underestimated. 



The Look Elsewhere Effect (see Section 6.5). 



• Subliminal Bayesian reasoning that the Standard Model is intrin- 
sically more likely to be true than some specific speculation about 
New Physics. 

• The embarrassment of having to withdraw a spectacular but in- 
correct claim of discovering New Physics. 

In contrast, incorrect exclusion of New Physics is not regarded as so 
dramatic, and so the weaker criterion of 5% is used. As Glen Cowan 
has remarked, "If you are looking for your car keys and are 95% sure 
they are not in the kitchen, it's a good idea to start looking somewhere 
else"[S]. 



Fig^a) illustrates the (poiPi) plot for defining various decision regions. 



Table 2: Choosing between two hypotheses, based on po and pi. 



Po 


Pi 


Result 


If Hq true 


If Hi true 


Very small 


O.K. 


Discovery 


Error of f** kind 


Correct choice 


O.K. 


Small 


Exclude Hi 


Correct choice 


Error of 2"" kind 


O.K. 


O.K. 


Make no choice 


Loss of efficiency 


Loss of efficiency 


Very small 


Small 


? 


? 
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6.4 p-values or likelihoods? 

Rather than calculating p-values for the various hypotheses, we could use their 
likelihoods Lq and Li. While p-values use tail areas beyond the observed statis- 
tic, the likelihood is simply the height of the pdf at tobs- We return to likelihood 
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Figure 8: Plots of pq against pi for comparing a data statistic t with two hy- 
potheses Hq and Hi , whose expected pdfs for t are assumed to be two Gaussians 
of peak separation A^, and of equal width a (see fig. [t]). (a) For a pair of pdfs 
with a given separation, the allowed values of (pojPi) lie on a curve or straight 
line (shown solid in the diagram). As the separation Afi increases, the curves 
approach the pq and pi axes. Rejection of Hq is for pq less than, say, 3 x 10~^; 
here it is shown as 0.05 for ease of visualisation. Similarly exclusion of Hi is 
shown as pi < 0.1. Thus the {po,pi) square is divided into four regions: the 
largest rectangle is when there is no decision, the long one above the po-axis is 
for exclusion of Hi, the high one beside the pi-axis is for rejection of Hq, and the 
smallest rectangle is when the data lie between the two pdfs. For A/i/cr = 3.33, 
there are no values of (pojPi) in the "no decision" region. In the CLs procedure, 
rejection of Hi is when the statistic t is such that {po,Pi) lies below the diagonal 
dotted straight line, (b) Contours of constant likelihood ratio r = Lq/Li for 
Gaussian pdfs. The upper right region is inaccessible; the diagonal line from 
(0,1) to (1,0) corresponds to the pdfs lying on top of each other i.e. no sen- 
sitivity. The diagonal through the origin is when tabs is mid-way between the 
two pdfs. With larger separation of the Gaussian pdfs and for constant po the 
likelihood ratio increases. 



ratios in Section [3 



As mentioned in section 6.1 



the Neyman-Pearson lemma provides the best 
way of choosing between two simple hypotheses, but even when one or both 
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hypotheses contain free parameters, the likelihood ratio may well be a suitable 
statistic for summarising the data and for helping choose between the hypothe- 
ses. In general, it will be necessary to generate the expected distributions of 
the likelihood ratio according to the hypotheses Hq and Hi, in order to make 
some deduction based on the observed likelihood ratio; for composite hypothe- 
ses there are of course the complications caused by the nuisance parameters. 
The decision process may well be based on the p-values po and pi for the two 
hypotheses (see fig. |8|. In that case, the procedure can be regarded as either a 
likelihood ratio approach, with the p-values simply providing a calibration for 
the value of the likelihood ratio; or as a p-value method, with the likelihood 
ratio merely being a convenient statistic. 

6.5 Look Elsewhere Effect 

If you are playing cards, and in your hand of 13 cards you observe that you have 
4 queens, you might think that that is very unusual. Indeed the probability of 
a random set of 13 cards containing 4 queens is 0.0026. However, since you 
decided that '4 queens' was unusual only after you looked at your cards, you 
might have been equally surprised by 4 kings; or 4 jacks; or ace, two, three, four 
of the same suit; etc. Taking these into account, the probability of a surprising 
hand of cards similar to what we had is going to be a fair bit larger than 0.0026. 

A similar effect explains why a seemingly improbable event in our every-day 
life (e.g. a chance meeting with someone we had been thinking about recently) 
may in fact be much more likely, if we have not decided at the beginning of the 
day that this specific event would be a real coincidence if it happened. 

Often in High Energy Physics, we are looking for some new phenomenon. 
Thus we may be searching for a new particle, whose mass is not pre-defined, in 
a histogram such as that of fig. |6] When we observe an enhancement, we can 
calculate the p-value (the chance of observing a statistical fiuctuation at least 
as big as the one in our data, assuming that no such particle in fact exists), 
at the observed mass. But this of course underestimates the chance of having 
a fiuctuation anywhere in our mass distribution, which we might mistakenly 
ascribe to a new particle. We thus need to calculate the probability of observing 
an effect at least as impressive as ours, anywhere in our mass distribution. In 
Particle Physics, this dilution of the significance is known as the Look Elsewhere 
Effect (LEE). 

Similar considerations are relevant for searches in other fields. Thus claims 
for discoveries of gravitational waves would need to calculate the chance of a 
statistical fiuctution mimicking the observed effect not only at the observed 
time, frequency and signal structure, but anywhere in the whole dynamic range 
of these variables for which a real signal is possible. 

Of course, specifying where exactly 'Elsewhere' is is fraught with ambiguities. 
Thus for the above example of searching for a new particle, fig. [6] is relevant 
for the possibility of it decaying to 2 photons, but other decay modes could be 
possible, and hence could be relevant to the LEE. Similarly, maybe the particle 
we are considering cannot be produced enough at high masses for us to have 
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the chance of detecting it, so the whole mass range is not relevant for the LEE. 
The conclusion is that when p- values are corrected for the LEE, it is important 
to specify exactly what has been taken into account. 

6.6 CLs 

The CLg method [21 [TU] was introduced in the LEP experiments at CERN in 
searches for new particles. When evidence for such a particle is not found, the 
traditional frequentist approach is to exclude its production if pi is smaller than 
some preset level 7, which in Particle Physics is typically set at 5%. However, 
there is then a 5% probability that Hi could be excluded, even if the experiment 
was such that the Hq and Hi pdfs lay on top of each other i.e. there was no 
sensitivity to the production of the new phenomenon. To protect against this, 
the decision to exclude Hi is based on pi/(l — Po), known as CLj^ It is thus 
the ratio of the left hand tails of the pdfs for Hi and Hq . Fig (sTa) shows a 
{Po,Pi) region for which Hi is excluded by CLg. The fact that it is clearly 
smaller than for the standard frequentist exclusion region is the price one has 
to pay for the protection it provides against excluding Hi when an experiment 
has no sensitivity to it. We regard it as conservative frequentist. 

It is interesting that the CLg exclusion line in fig. [sj^a) for the case of two 
Gaussians is identical to that obtained by a Bayesian procedure for determining 
the upper limit on /ii when the latter is restricted to positive values, and with 
a uniform prior for /ii. In a similar manner, the standard frequentist procedure 
agrees with the Bayesian upper limit when the restriction of fii being positive 
is removed. 

In principle, similar protection against discovery claims when the experiment 
has no sensitivity could be employed, but it is deemed not to be necessary be- 
cause of the different levels used for discovery and for exclusion of Hi (typically 
3 X 10"'' and 0.05 respectively). 

6.7 When neither Hq or Hi is true 

It may well be that neither Hq nor Hi is true. With no more information 
available, it is of course impossible to say what we expect for the distribution 
of our test statistic t. On the plot of fig. [sj^a), our data may fall in the small 
rectangle next to the origin. 

It is certainly not true that a small value for pq necessarily implies that Hi 
is correct, although for small enough po, ruling out Hq is a possibility. 

7 BAYESIAN METHODS 

The Bayesian approach is more naturally suited to making statements about 
what we believe about two (or more) hypotheses in the light of our data. This 

^This stands for 'confidence level of signal', but it is a poor notation, as CLs is in fact a 
ratio of p-values, which is itself not even a p-value, let alone a confidence level. 
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contrasts with Goodness of Fit, which involves considering other possible data 
outcomes, but focusses on just one hypothesis. 

All Bayesian methods involve the likelihood function, possibly modified to 
take into account nuisance parameters. For Hypothesis Testing, some form of a 
ratio of (modified) likelihoods is usually involved. For simple hypotheses, this 
is just Lo{Ho)/Li(Hi), where Li{Hi) = p{x\Hi), the probabifity (density) for 
observing data x for the hypothesis Hi. The issue is going to be how nuisance 
parameter^ are dealt with for non-simple hypotheses. For the likelihood ap- 
proach (as opposed to the Bayesian one, which also requires priors), it is usual to 
profile them i.e. the profile likelihood is Li(iJj|j/f,est)j where i^best is the set of pa- 
rameters which maximise L. In Particle Physics, the profile likelihood approach 
is a popular method for incorporating systematics in parameter determination 
problems. 

The complications of applying Bayesian methods to model selection in prac- 
tice are due to the choices for appropriate priors. This is particularly so for 
those parameters which occur in the alternative hypothesis Hi but not in the 
null Hq. 

LoredofTT] and Trotta[Tnj have provided reviews of the application of Bayesian 
techniques in Astrophysics and Cosmology, where their use is more common than 
in Particle Physics. 

7.1 Bayesian posterior probabilities 

When there are no nuisance parameters involved, the ratio of the posterior 
probabilities for Hi is Ppost{Ho\x) /ppost{Hi\x) , where 



and TTi is the assigned prior probability for hypothesis i. For example, the 
hypothesis of there being a Higgs boson of mass 110 GeV might well be assigned 
a small prior, in view of the exclusion limits from LEP. 

With nuisance parameters, the posterior probabilities become 



where tt^ is the prior probability for hypothesis i and TTi{i') is the joint prior for 
its nuisance parameters, i.e. we now have integrated over the nuisance param- 
eters. This contrasts with the likelihood method, where maximisation with 
respect to them is more usual. Even with 7r,;(i/) being a constant, integration 
and maximisation can select different regions of parameter space. An example 
of this would be a likelihood function that has a large narrow spike at small z/, 
and a broad but lower enhancement at large i/. 

^For the purpose of model comparison, any parameters are considered as nuisance param- 
eters, even if they are physically meaningful, e.g. the parameters of a straight line fit, the 
mass of the Higgs boson, etc. 



Ppost{H,\x) = L,{Hi) TT, 



(18) 




(19) 
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In relation to all Bayesian methods, it is to be emphasised that the choice 
of a constant prior, especially for multi-dimensional is by no means obvious 
(compare Section 2.3). Very often, there are several possible choices of variable 
for the nuisance parameters, with none of them being obviously more natural 
or appropriate that the others. Thus a point in 2-dimensional space could be 
written as Cartesian (a;,?/) or polar (r, 0); constant priors in the two sets of 
variables are different. Similarly in fitting data by a straight line y ~ a + bx, 
using a seemingly innocuous flat prior for h = tan 9 results in angles 6 in the 
range 0° to 89° have the same prior probability as those in the range 89° to 
89.5°. 

It should be realised that the results for Hypothesis Testing are more sensi- 
tive to the choice of prior than in parameter determination. Thus in parameter 
determination, sometimes a prior is used which is constant over a wide range of 
and zero outside it. The resulting range for the parameter, as deduced from 
its posterior, may well be insensitive to the range used, provided it includes 
the region where the likelihood Liy) is significant. For comparing hypotheses, 
however, there can be parameters which occur in one hypothesis but not the 
other. (An example of this is where Hi corresponds to smooth background plus 
a peak, while Hq is just smooth background.) The widths of such priors affect 
their normalisation, and hence also the Bayes factor (see next Section) directly. 

On the other hand, in searches for a new particle of unknown mass, the 
Bayesian probability for the particle existing will depend on the range of the 
prior used for the particle's mass - the wider the range, the lower the normalisa- 
tion and hence the lower the probabilitjQ At least qualitatively, this resembles 
the effect of the LEE in the frequentist approach, where the significance of a 
peak in a mass spectrum is diluted if the search extends over a wide mass range 
(see section 6.5 1. 



7.2 Bayes factor 

For each hypothesis we define Ri = Ppost/T^, where Ppost and tt are respectively 
the posterior and prior probabilities for hypothesis i. Thus R is just the ratio of 
posterior and prior probabilities. Then the Bayes factor for the two hypotheses 

and Hi is Sqi — Ro/Ri- If the two hypotheses are both simple, then this 
is just the likelihood ratio. If either is composite, the relevant integrals are 
required for Ppost- A small value of i?oi favours Hi. 

Demortier 13 has drawn attention to the fact that it can be useful to cal- 
culate the minimum Bayes factor 14 . This is defined as above, but with the 
extra nuisance parameters of Hi set at values that minimise i?oii they are 
as favourable as possible for Hi. If even this value of Bgi suggests that Hi is 
not to be preferred, then it is a waste of time to investigate further since any 
choice of priors for the extra parameters cannot make Bqi smaller. 

^This is an example of Occam's Razor, whereby a simpler hypothesis may be favoured over 
a more complex one. 
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Table 3: Cost function. Typically the cost A assigned to a false discovery claim 
would be larger than B, the cost for a failure to discover. There is zero cost for 
making a correct decision. 





Hq true 


Hi true 


Accept Hq 


Correct choice. Cost =0 


Failure to discover. Cost = B 


Reject Hq 


False discovery claim. Cost = A 


Correct choice. Cost — 



7.3 Other Bayesian methods 

The Bayesian approach can be used in conjunction with Decision Theory, in 
order to provide a procedure for choosing between two hypotheses. In addition 
to any priors, a cost function has to be defined, which assigns a numerical 
'cost' for each combination of the true hypothesis {Hq or Hi), and the possible 
decisions - see Table |3] The decision procedure is designed to minimise the 
expected cost, as determined by the cost function and the expected distribution 
of posterior probabilities for Hq and Hi . 

Because of the problems of assigning realistic costs, and the use of priors in 
determining the posteriors for the hypotheses, there is little or no usage of this 
approach in Particle Physics searches for New Physics. 

Other Bayesian methods such as AIC, BIC,.... (Akaike Information Cri- 
terion, Bayesian Information Criterion, ) aim to provide approximations to 

the Bayes factor, but which are easier to calculate. Given the powerful com- 
putational facilities available nowadays, these methods are likely to decrease in 
general usage. Again there is little or no experience of using them in Particle 
Physics applications. 

7.4 Why p is not equal to the hkehhood ratio 

There is sometimes discussion of why a likelihood ratio approach (or the Bayes 
factor, if there are nuisance parameters) can give a very different numerical 
answer to a p- value calculation. A reason some agreement might be expected is 
that they are both addressing the question of whether there is evidence in the 
data for new physics. 

In fact they measure very different things. Thus po simply measures the 
consistency with the null hypothesis, without any regard to the degree of agree- 
ment with the alternative, while the likelihood ratio takes the alternative into 
account. There is thus no reason to expect them to bear any particular rela- 
tionship to each other. This can be illustrated by contours of constant values of 
the likelihood ratio r = Lq/Li on a po versus pi plot (see fig. [sjjb)). The fi gure 
is constructed by assuming that the pdfs for the two hypotheses Hq and Hi are 
given by Gaussian distributions, both of unit width. Then at constant pg, it is 
seen that the likelihood ratio can take a range of values, corresponding to the 
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Gaussians having different separations. Thus with the Gaussian for Hq^s pdf 
centred at zero, a measured value of 5.0 yields apo-value of 3 x 10"'', regardless 
of the position of the Hi Gaussian. Such a small p-value is usually taken as 
sufficient to reject Hf). As the centre of the Hi Gaussian starts at /ii = 0. the 
two Gaussian pdfs are identical, and r = 1. With increasing /xi, pq of course 
remains constant, but r ad, first decreases to a minimum when ni = 5, but then 
increases through unity when /ii = 10 (i.e. the data is midway between the pdf 
peaks), and then keeps on rising with further increases of the separation of the 
pdfs. At that stage, the data are more in agreement with Hq than with Hi, 
despite the small value of po. 

8 CONCLUSION 



Table 4: Comparison of Bayes and Prequentist approaches 





Bayes 


Frequent ist 


Basis of method 


Bayes Theorem posterior 


Uses pdf for data, for fixed pa- 




probability distribution 


rameter values 


Meaning of proba- 


Degree of belief 


Frequentist definition 


bility 






Probability for pa- 
rameters? 


Yes 


No, no, no 


Needs prior? 


Yes 


No 


Choice of interval? 


Central, upper Hmit, shortest,... 


Choice of ordering rule 


Data used 


Only the data you have 


Also other possible data 


Needs ensemble 


No 


Yes (but often not explicit) 


of possible experi- 
ments? 






Obeys the Likeli- 


Yes 


No 


hood Principle? 






Unphysical/empty 


Excluded by prior 


Can occur 


ranges possible? 






Final statement 


Posterior prob dist 


Param values for which data is 

likely 


Do param ranges 


Regarded as unimportant 


Built in 


cover? 






Include systematics 


Integrate over prior 


Extend dimensionality of fre- 
quentist construction 



We have seen how Bayesians and Frequentists differ fundamentally in the 
way they consider probability. This then affects the way they approach the 
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topics of parameter determination, and of choosing between two hypotheses. 
Table [4] provides a summary of the differences between the two approaches. 

A cynic's view of the two techniques is provided by the quotation: 

"Bayesians address the question everyone is interested in by using assump- 
tions no-one beheves, while Frequentists use impeccable logic to deal with an 
issue of no interest to anyone." 

However, it is not necessary to be so negative, and for physics analyses at the 
CERN's LHC, the aim is, at least for determining parameters and setting upper 
limits in searches for various new phenomena, to use both approaches; similar 
answers would strengthen confidence in the results, while differences suggest the 
need to understand them in terms of the somewhat different questions that the 
two approaches are asking. 

It thus seems that the old war between the two methodologies is subsiding, 
and that they can hopefully live together in fruitful cooperation. 
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